Okay. Hmm so we all attended Interspeech, no? So how many sessions you attended? I mean did you attend any any session? Yeah. That's really that's really your So any general impression or feedback? So f for any of you it's a first conference or maybe for? No, I mean in the sense that it's uh Interspeech is really big conference for speech, so did you attend before? Uh, ok ok yeah, o ok yeah, so then you're hiding a you have attended before yes. Yeah, I see see. Yeah, yeah. I t I Uh I think it's quite interesting, but only annoying thing is this multiple sessions. And most sometimes you can't able to go to oral talk this oral presentations most of the time, because because like posters is you can spend a lot of time posters looking at m many posters than sitting twenty minutes for one oral presentation. So in t twenty minutes you can see at least two, three posters, and you can directly talk to people, so. Even I found like uh very few people in some oral presentations. I think most of the people are like their own posters only. Yeah, how presentations are hardly handful of like f five or more. Yeah. Ah, ok. Hmm yeah. Ah, okay. Yeah. Even some panel discussions on that human speech recognition reducing gap between the A_S_R_ and and H_S_R_. It was a bit interesting, lot of arguments and Yeah, b differently it's mainly the differences different approaches of engineers versus linguists or phoneticians and yeah mm. Ah, okay. Yeah, even the panel discussions w I think one is really held in small room, so people were really crowded. Yeah, it's Yeah,. Yeah. Ah ok we went to like far south, to Lagos and yeah, that wa uh ha yeah. That was really good, those beaches are really good. Yeah, even dolphin in the yeah. But the weather was really hot, the south it's more than thirty five. Uh. Lisbon was good, um little bit bit mm. Yeah, even local transport, it's it's. But yeah, most of the time the buses are really crowded. Uh. Yeah. Yeah. Yeah, because it's really big city, no? No like. Mm. Yeah. Hmm. Yeah. Yeah. No, re yeah, but that's that's not good, direct bus is not good like. So you can go to the direct uh ce central place and then occas uh. Mm. Yeah, da Uh what is the place, uh the the central I mean where we change the bus to yeah, in the yeah, maybe you can check the booklet. Yeah it's there, like I used to find bus numbers from book only. Oh, okay. Oh, okay. It's like uh in It's like uh all the semi-tied covariance matrices or yeah. So did they reduce it's in between the direct covariance this is full covariance. Should try to reduce some parameters by tying. Oh. Uh. Ah, okay. Yeah. Mm. In your tie-in. No, I think they got good results with just using diagonals or definitely yeah. Yeah. Mm. Maybe decorrelation again they do D_C_T_ or K_L_T_ or they do that? L_D_A_ or like Yeah. So No, but still there is there is still ev that's why like people come Yeah, definitely it's not optimum. So it's it's also like doing along along the diagonals all again. It's really computationally really uh yeah so see suppose if thirty nine in te thirty nine to thirty nine for every yeah, yeah, so Uh there is a thing Yeah. Ne no problem, but if you see so many models and so many mixtures yeah. But Yeah, thirty you can see you c you can't really like even you can't the models itself is like thirty eight times more than that, so. So y if you have like five megabytes or ten megabytes of models then yeah. But even I think it's really bit like impossible for the really big systems I guess, like storing storing itself. Yeah yeah, yeah yeah, it's really ti Yeah yeah, for every you're so you're to store all this thirty Yeah. So from Mm-hmm. Mm-hmm. Speak uh yeah, different task and So uh I found one paper interesting. I think you know Vivek most of you know. Uh this is uh Vivek's paper on like variable scale feature extraction. Normally we use to fix it window for feature extraction, like twenty five milliseconds or thirty seconds. Here he's proposing variable scale, because the fixed scale is non-optimal for it's non-optimum because like for vowels you can have m much longer, and for plosives and these things you really shorter, like even around less than twenty milliseconds also. So he's ki proposing like this variable scale window f uh kind of online for each segment. Yeah mm mm you can do like you can measure and you can do but he's mm basically doing some mm likelihood ratio testing. So if so the main idea is like suppose you have uh one segment, so he's trying to find the stationality cautious stationality of that segment. Yeah, as much as possible like but yeah, definitely you have to assume some l uh like minimum and maximum sizes of your own, so he is using minimum of minimum is uh I think twe I don't twelve point five milliseconds I think. Maximum is uh sixty milliseconds. So the minimum No no, I think a minimum is twen twenty milliseconds is i yeah. Twelve point five milliseconds is kind of sift. Twenty millis minimum is no no, de because of uh No, you can use mol yeah. No, mainly because of uh M_F_C_C_ computation, because yeah, it becomes really noisy, like if you s yeah, ten milliseconds means you have only eighty samples for uh and then you have refused twenty four filters, the filters won't get any samples, so. Yeah, because of computation. So even I I think even st uh he can find less than twenty millisecond windows also, especially for plosives uh. Shifting yeah, he's using twelve point five millise yeah, shifting is almost same. But the problem like He Yeah. Yeah, he's keeping same number of frames. Uh he's using uh twelve point five yeah, twelve point five millisecond Yeah. Number of uh frames? W you want to Actually the problem's again uh uh you see the shift is the Nyquist frequency if you compared modulation spectrum. So again, if you change this shift, the mod the Nyquist the modulation spectrum, Nyquist frequency changes for each window. So if you want to do again another high-level feature extraction again, it will be problem, so. So but he's uh like he's he discuss I mean describing this, because even this itself is a problem, keeping the f uh fix to frame size, because you're analysing your this uh segment many times. So this shift will suppose if you find one segment of sixty milliseconds, and then you're doing this ten millisecond so every almost I think yeah, mm uh yeah, five frames y you're analysing the previ this already segment so. So again, this may blur some frequency transitions or yeah, but the problem is again this uh bottleneck is here, like you can't change your Nyquist frequency modulation and Yeah. Yeah. Yeah. So it's so uh so the main idea is like first to take the some some window, some sam some samples, then he assume somewhere some variable and point there is a change. Then then it's not symmetric, i you can cho uh y it's basically uh too like again, do for all the samples in that gyro. So then what he does is like he proposed uh like some like ratio test based on maximum likelihood. So th tha that is really simple, like so what he does is like so he he computes the residual, he first he does the L_P_ analysis and then he computes the residual, and he takes the residual energy of f full signal like gy gy gyro samples. The full signal. So maybe I can The wh the whole window. No, that's that like he assumes yeah, sixty milliseconds or something he start with, so so then he can yeah, yeah. Yeah, then he can split that window at point N_ say point N_, so then you will have gyro two no, you can segment it maybe. Yeah, so yes, he's supposed this is your p window. So then you can move your like this. You can move your point, so that then this will be like one and this will be second. But again, he assumes some initial uh sizes for these windows. Yeah. He don't start with the yeah, he don't s yeah, start with gyro sample or something. He's starting with uh so the l left window is starts at twenty milliseconds, so so you already hear like s suppose this is full signal, you're already here. So right window is should end at twelve point two, so you're to search basically in this range. So Yeah. Yeah. Yeah, and then if you the maximum is sixty millisecond frame for him. So if you don't find anything Sixty milliseconds. If you d No, the point is depending on the likelihood ratio. So he got so he supposed this so this is suppose se segment one, and this is segment two. So he compute this error here. So and then he p gives some kind of in terms of this residual letter. So this residual letter is for full gyro samples. Then there would be suppose this point is say N_. So gyro to N_ and then like So this the b basically the main he questions or here Yeah, he's for finding the likelihood for full frame, and then it's basically likelihood ratio test, so he's comparing the likelihood of the full segment divided by the likelihood of the the sub-segments. Likelihood of the uh like this likelihood is uh this error estimate of the residual. So residual power he can compare t so he after do after doing L_P_ he can get a l residual and then he can compare the power of the residual. So that he's proving that power is again maximum likelihood estimate of your L_P_ parameters. So that which interesting it's you don't really need to do a lot of computation, you just need to take the error of n uh residual of this full window and then residual of energy of the sub-windows and then you can just divide them and then the only thing is like again he has to use some threshold to decide, that that's only the p problem. He got uh he's founding something more uh around three or three point five E_S_T_ optimum threshold, but the and another advantage is like it's not really changing speech recognition, whatever, so not really changing because of the threshold, so. Uh you don't really need to f like fiddle with threshold a lot, so mm. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah, this is yeah, I understand. This is this is the actually basically some theoretical proof, like maybe if you want threshold you can see and he's coding from uh Leven like uh. No, this is from again yeah, statistical signal processing. What they say here is like suppose if you analyse two distinct or uh this L_P_ analysis in the same stationary analysis window, the coding will always be greater than the ones resulting from this analysis in two windows. Two stationary windows. So the he's basically based on this theorem, so so what he's saying is like if you do this error, it will be always greater than the mm th these errors of two stationary windows. Yeah. Yeah. Yeah yeah yeah yeah. Yeah, this maybe I think yeah, this he's getting some improvement, but Yeah yeah yeah yeah. So he's just uh uh he's normalising the energy coefficient, because energy is really affected a lot. Uh so he's li Energy like, because C_ gyro component like, so he's using M_F_C_C_ so he's normalising the C_ gyro component for this t yeah. Power is again like yeah yeah yeah yeah, it's time like yeah. It's okay like samples, or you can see root mean square energy or something. Yeah. Yeah. No. No, n no, you vo what you mean like the window lent. The sixty mil no, it i it doesn't really Uh but I think it's he he doesn't use I guess, because uh uh then like you just you take these features and you train model, so. It really matters wh what frame shift you're using for models, like because how many he needs to use same freq No no, wait. No, he will get like every ten milliseconds he get one fra one features, like he it doesn't really depend, because suppose if you take your t case, like you your window i maybe longer, so doesn't really matter like how what size window you use or not. But I tell uh at every ten milliseconds whether you give some features or not really matters, no? Like so. So it doe you can use uh thirty milliseconds or fifty millisecond window. Same. Yeah yeah yeah. Yeah yeah yeah yeah yeah. Yeah yeah, framing it's like it's really yeah. Yeah, if you have if you're change framing, it's really I think even it's really problem, no? Like if you try even models also I guess, if you change. It's not only with this model, it's in spectra and shifts in deltas, because Yeah, but it's really complicated I think. If you use yeah, that becomes really Yeah. But this is kind of interesting. Yeah yeah yeah, the temporal dec. Mm-hmm. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah, first this is proposed by like uh even we read one paper in our reading group, you remember that? I have implemented, it's quite working well, like uh yeah. Mm. Yeah. Yeah, Honza is also worked for his be yeah. Uh i yeah. Ne he was re he was quoting But that Yeah, he was uh quoting that paper also, like v he was telling this uh this is kind of optimisation criteria what this temporal decomposition is doing like. You're trying to segment your s signal into like discreet windows and then so but he was telling like this uh relationship between optimation and and then cautious stationality is not really obvious. So stationality is again different, no? Stationality is this is optimation, we want to like see some signal, few segments, which can really represent whole, so maybe that's why like this may not be really Yeah. Yeah, this one yeah, he's getting some improvement n uh this is a, yeah. No no no, they again But again uh, getting that ten papers is really difficult task. It takes maybe years and again uh it definitely like who will who will choose the ten papers like, and if we asked yeah, but yeah, those things are practically kind of really but I think this kind of things what you showed I mean what you said is really good, like if people start implementing like some people propose something in feature level, some propose in something in model level, but these two are really independent, no? When really work combine these two or Yeah, especially features, like suppose if somebody come with different good features, again people st again use M_F_C_C_s No no, it it's not really sure also, make suppose uh people you can't really force people to use same features, so so they wo they will be happy with their own features and their own scripts, so their bit rate like tend to change features every time and it's Hmm. But even NIST you can Yeah. Yeah, yeah. Then So at lea especially L_V_C_S_R_ in really big systems, then people Yeah. Yeah. Because yeah, it's it's again you don't know. Oh. No, then uh there will be only one conference in three years or something for speech, because But again like That's so you can publish only once in three years or Yeah. But even people are doing suppose in Interspeech there was uh some challenge for speech synthesis, so so what's is uh like it challenge was really good, like sup they give the database, so you have to work on that database. What technique t use is it's your choice like, but that database and then the results analysis is they decide, so. Even speech recognition also some tests are coming, like for phone segmentation or something. People give some database and then you have to dis you can use whatever you want and you have to produce even for features also like features also I think Yeah yeah, the they they designed the data such a way that it's it's really like really real uh data. No no, for s nay, speech synthesis it's like they give you some data, so whether like you use for training or l you can you can do like data driven or like model based approach or whatever like. At the end then they will ask you to synthesise some sentence and you have to synthesise them and then you have to send them Your training data, onl no development, training, no yeah, training Evaluate it's again subject to s uh because for speech synthes yeah. Yeah, listen and yeah d yeah, even uh they it's mostly they use native speakers only, because they can really judge well, so. Yeah yeah yeah yeah. Yeah, yeah yeah. So even they propose some kind of word error rate. So you have the original speech, so i the synthesised speech, is there any words which are not matching? So even speech synthesis, they're also uh introduced this W_E_R_ term, so. Uh. But this was really quite successful like in this conference, Interspeech. So a lot of people participated in uh I think even they're continuing this for uh next year and Text to speech. But uh I heard like even for I_CASSP next year there is some com uh like computation by Martin Krug, Sheffield, and on this feature extraction stuff and so at least if you make task simple focus, then it may be good to compare these features and but if you s use some uh large vocabulary system, it takes t six months to build and at the end you don't know whether your features are really uh so. So maybe that's for like people are always using P_L_P_s or M_F_C_C_s, it's it's such a b like lot of time involved, so you can't really check many thi Yeah, yeah yeah yeah. But at least in IDIAP we have this numbers recogniser is almost kind of free, so everybody is using so that's what we are doing, no? Almost almost like we are putting different features and I think yeah, yeah, numbers recogniser is not really different from like uh a f at least in IDIAP we have almost same, but maybe you are using twenty nine, that's for you it's different, but one more But at least i twenty seven. Hemant and we're all using twenty seven, so. Yeah. Yeah. No, but the problem is you're using only digits, so maybe that's how you No, but test set is digits, your m main task is digits k yeah, that's uh like but O_G_I_ numbers are there like now we're little bit convulsed, like we're using twenty seven phones and before it was like twenty four, twenty five. But it's better, like once you have like whatever the back end, then you can put features like whether they're gammas or like spectral entropies or M_F_C_C_s or whatever. Then at least you can see Uh? Yeah yeah yeah. But Aurora is Aurora task. So but Aurora, is it really big database, or I mean how much time it takes to set up system and then It's fast, no? But the problem is Aurora again, these models are word based models, no? Yeah, so again here we use triphones and so but definitely, I d if we want to show noise setting, it's you have to show on Aurora also, like Aurora is real time. Yeah. Yeah. Yeah, yeah yeah. But yeah Yeah, at least I mean wha those studies I think people already did for even in I_C_ I_C_S_L_P_ two thousand two there is special session on features for Aurora, so yeah, yeah, i yeah yeah. This ICSI features and this uh s uh s uh O_G_I_ features and all this. So there are already people but again like uh, at the end like for L_V_C_S_R_ people are using P_L_P_s or M_F_C_C_s or nothing of these fancy feature Yeah. New But maybe my have a f for publishing another paper or something. Mm. Yeah. But maybe Guillaume, you can tell, no? Like uh you worked i with these Siemens people or no, Siemens or these Daimler Chrystler. What kind of features do you use, like do you use um any you don't yeah, you don't Ah ok okay. Okay. Ah ok Yeah, even s yeah. I didn't discuss anything with Sunil also, like I dunno, technical discussions, anything. these kind of things are a bit, yeah. Yeah, yeah. But do you have any like feeling that whether they go for all these new features or they usually use only mm. Maybe yeah. Mm-hmm. Hmm. But even they don't believe like with one paper or two I guess, because at least for them no P_L_P_s or M_F_C_C_s, they know that okay, these things work on every task, so okay, we can pu put hands on these things yeah. Huh? Yeah, yeah yeah yeah. Maybe like yeah, f because Sunil is working in Nokia, he's a expert in kind of features or yeah, it l I think. But yeah, we're not sure, even Sunil won't tell us. No, no n no, this uh Cohen, you mean like uh uh Hmm. Oh ok Huh. Uh. But this was signal company, no like? This voice signal company, Hy Hynek always mentions uh Cohen's company with what they're using, you n yeah. Voice no no, voice signal. They make uh recognition for these Samsung phones and yeah. Uh they have like really s many recognisers I guess. Maybe definitely they may be using P_L_P_s or you know. Yeah RASTA or RASTA or something you mean. Mm. Yeah, because mm. Yeah, but again Yeah, too new is really definitely because like want to at least see i at least four, five years of results of some new features and some consistent results or Ah. Uh space or ah, okay. Yeah yeah yeah. Mm. Yeah. But yeah, d Yeah, uh another issue is like again the computational issues, no? If they want to make on mobile, they don't want really use some fancy, really expensive computationally expensive features or all these Yeah yeah, it It's just phone recognition or something like Yeah. But most of the time mobiles, they use D D_T_W_ kind of thing, no like? All these uh so voice calling yeah, voice calling and these thing, because that's easy like, because they don't really need to put lot of uh memory and these things, otherwise if they want to build a real Yeah yeah, yeah. Where where? Mm. Yeah yeah. Yeah, Cohen, yeah, yeah. It's Samsung, some some new really new phone like, mm. Yeah. But uh the software is always available or like No, the software is again on top of mobile, or like it comes with your mobile No, the thing is usually these kind of these kind of facilities are like no not normal, so so they may charge more or like yeah, yeah. Yeah. Ah, ok yeah, definitely they may charge for more money f again, some extra bill for this using this recognition engine or something. Yeah. Yeah. Yeah. Yeah, but maybe like if it can't really recognise you it will be really annoying it's better. Yeah, I have used once like this dragon, that was kind of good like, yeah. Because even in when I was working in uh like in Edinburgh like, one guy, he use i he got some problem with hands and then he was always using this n uh dictation even he used to write lot of C_ plus plus code with the dictation like. It's really good like, so. But you ought to really train and then you're to kind of uh yeah. I get used to the system. Yeah yeah, it's really like you ought to get used to the system, like you how to say all the callings and this thing system how t but it's really funny like, he is able to use it for like like years, I mean he used to write a lot of code and, you know, it's really great. Yeah yeah, you can listen Festival. Read the like Mm yeah. Yeah yeah, Fes yeah, I think I think you can do the Festival, you can just call like uh this there are different kind of again, how you call the Festival, so you can just give the full there is an too. So then you will get the all the caller and other things. If if you want to like read full text, then you can say like the full text, like those syntax and then it will speak till it finishes the text. Yeah, it's real it's yeah. Diphone, yeah, yeah. That's why it's real time, like otherwise it's bit difficult and it's okay, you can easily understand, so it's intelligible, so. Uh Festival is good, like now it comes at all the Linux boxes also, so in so already in this No the there are so many voices, like again Which voice you want to use. No, most of the the Festival, uh what you get on Linux machines, is diphone based only. So they supply few voices. Yeah yeah, yeah, just uh concatenation of the no, it's already there, if you just put li Festival, it comes wi because it's comes with direct Linux software and you know, it is open source. Festival is open It's Yeah, cat voice, like there are like they they have few Yeah, new voices like, yeah. Yeah, that's the difference that tha in views this unit selection. It's not diphones, you have really large inventory of the speech. Uh those voices, they're not yet released. Yeah. No no, it's free, they'll be releasing soon I guess. So maybe you can just check no, i it's it's here on the web, then it won't come with your Linux software, so you have to again download from uh their web page. Yeah, you can download, now. Yeah, yeah yeah. It the voices are called Multisyn, Multisyn uh sl uh that's some voice. So you can download from the voice from the web Festival. Yeah yeah yeah, yeah. Yeah yeah, yeah. So voice is like the database of, the parallel speaker who speaks that this text, so. It was really large database, so it the voi the quality will be better because you can find similar yeah, it's more or less but not exactly like uh it's two times or one point five times or something. But you can prune like how much you want, because so this is again like uh that's kind of my P_H_D_ work how in this system like how you choose the units and how you concatenate. So there are again cost functions and Viterbi. So you can you can always optimise, you can put some thresholds and pruning and then you can make it real time. So then again, compromise uh between the quality and yeah. But still it's good like now. Th you can just check Multisyn and then voices. Uh yeah, yeah, it's really good. Yeah yeah yeah yeah. But only sometimes like it can't really call this yeah. Uh standard text. Yeah. Yeah, ASCII file is ASCII and but only the problem's again acronyms and sometimes it expands, sometimes it may not expand, because it's not in their dictionary, so yeah. It's not in the dictionary, like if we again it's problem like, so. But those things you can add, like if you're really familiar with Festival, so you can add always these Yeah. But at least these things okay like, it won't really make mistakes so often like, sometimes Alan Black and Paul Taylor, they started r Festival in ninety six or something. Edinburgh C_S_T_R_, yeah. Uh f oh uh this ma this Johann Waters or like no, they d what they did Portland O_G_I_, they did L_P_C_ based synthesis. The Festival is alrea already like it's one first you start with th Ah, okay. Mm. From. I don't know. No no, actually the b Ah, okay, okay. Yeah, maybe, because Ah, ok Hmm. Yeah, first they started in C_S_T_R_ with Alan Black and Paul Taylor, then then it expand to many places, because if somebody is fil no, Simon King is he's one of the others, but Rob Clark is the man that you met um. Uh uh, already. Yeah. It's So anything uh more about Lisbon or Interspeech? I it's enough, yeah. Okay.
Okay. So you start I think, right? Yeah, so we are going to talk about the papers from Interspeech. This is not a good question, I think. All of them, of course. All of them, of course. Oh yeah, that's true. Okay. Bad answer, right? Like Yeah, I it's not every year, right. Yeah, we met in also in I_C_S_L_P_, right? The first time, so. Yeah. So how how did you like the conference anyway? When in compared to to th the others, the previous one. But Yeah. Right. Oh yeah. He was from Australia, right? The guy yeah. Uh he's very famous I think for that, right? He's the the inventor of that implant and for for the ear I think. S yeah. Mm. So you went to a beach? To a like Atlantic sea coast or yeah? Uh-huh. Uh it's not so far from Lisbon, half an hour, mm-hmm. Uh northwest mm. Yeah. Yeah. But I think Lisbon those days are pretty good, like seem to me quite not cold but still okay, like reasonable. Oh yeah. Yeah, I was happy to be back in Suisse after few few days it's better to be here I think. Yeah, it's big city, many people and Yeah, yeah. Those new only, right? Not the old one, but yeah. Yeah. Ah it's not necessary to have it, maybe. Some time, yes, but not so many days or Oh yeah, yeah, that's true. Conference. Yeah, yeah. It's Better to ta Yeah. The easiest way. I don't remember the name. Is it there? Yeah, i it's supposed to be there. Oh, you've got the s Yeah. Uh-huh. Like No, I don Right, mm-hmm. Yeah. So they obtain better results with that finally or But Right. But if you compare no, if you compare it with some baseline, let's say you use just G_M_M_ with diagonal covariance matrices A and no decorrelation before like uh Yeah. Or L_D_A_ is there or Yeah. But but there must be the sense, right? Because once you decorrelate the data, then you don't need to have full I know, that's so that's that means that the decorrelation is not uh optimal, right? So so that's the reason. Mm-hmm. Yeah, yeah. It is, yeah. Uh no, there there is not Yeah. Right. Right, right. Yeah, yeah. Yeah, yeah, definitely. It's almost impossible for like L_V_C_ aside, impossible to use it, right? You've got dimension thirty nine times thirty nine? Yeah? So yeah. It's impossible to use it. And you cannot even train it, right? More or less I think. Oh, so that's interesting. Mm-hmm. Much longer. Shorter, yeah. Right. Wh what is the minimum and maximum? So it's not so mm Two point five milliseconds. Twenty milliseconds? From Oh, yeah. Right. Sure. And the shifting is still the same, or not? O yeah, yeah, overlap. So Like um no, how many frames per per second you do have? Is it again like uh one hundred frames, or it's less or even if you keep variable length of the frames, right? You can still keep the same frame rate. Because I understand this would have sense for, I don't know, speech coding where you want to preserve uh or you want to encode it into less frames, right, but why do why to do it for speech recognition? It means that uh Yeah, no, wha why to keep variabl variability in the length of the frames? Wh why not to keep the same length, what is the Yeah. Right. Right, right, right. Mm-hmm. Right, right, right, ri right. Mm-hmm. Mm-hmm. Mm-hmm. Right, yes. Right, yeah. Mm-hmm. Right. Mm-hmm. Mm-hmm. Mm. Mm-hmm. Well Mm-hmm. Yeah. Yeah, probably. No, I don't think so. There is no information. Hmm. Yeah yeah yeah, right, right, right. Yeah, yeah. Yeah, it might be difficult. Mm mm-hmm, probably. Who knows, maybe it's possible to use it somehow, but then there is uh information about temporal information is somehow included in such yeah. Yeah. There is also some um like uh not paper, but I saw the algorithm I have been using even that called temporal decomposition, I don't know if you kn you know that like Uh I don't know who propose it, I just had it from from BEAMbot for like a BEAMbot, do you know the guy? He's French um somewhere now in teleconference like mm I don't know where he's working, but th we were using it for uh speech coding, so we just had a speech and you had decompose a speech into such segments, temporal segments which were stationary inside. And then you can more or less uh quantise those segments somehow and use it for encoding let's say, or and it worked prett yeah, yeah. They were using like S_V_D_ stuff, singular value decomposition for that and it worked pretty well, like I was surprised like. Yeah, yeah, exactly, uh he was using that. Yeah. Oh well, it's it has been used for speech coding. Nob nobody use it for recognition stuff I think, I never heard that. No, I don't of course. Yeah, that's good question. Maybe. Right. Right. Mm-hmm. Yeah, that's true. Mm-hmm. Mm-hmm. Mm-hmm. That's true. That's important. All the papers. Okay. I think people wouldn't like it. Many people wouldn't like it. Right. Yeah, yeah, yeah sure. Exactly. Right. And everybody's uh like uh using different training and testing data and then you don't know if like They show it works for TIMIT, but then you try to use some different databases, you see it doesn't work. Yeah, sure. Yeah, yeah, you can. Yeah, exactly, yeah yeah yeah. NIST evaluations. Yeah, but nobody you know, it's those new systems are still not general, like you know, like those P_L_P_ and M_F_C_C_s, because everybody knows that di it worked somehow, right, for any kind of data. That's why they're comparing to to that, so. Once somebody will come with something new, okay, he he's showing it works for some datas, but still yeah, yeah. First it's difficult to to show that it works for all all the datas, right, because it's really and Yeah. Yeah, that's your choice. Mm-hmm. No no, i just for Yeah. I think so, yeah, probably, mm. Mm-hmm. Mm-hmm. Ri I It's just T_T_S_, right, text to speech? Right. Mm-hmm. Right, right. Are they? For O_G_I_s? For O_G_I_ numbers? Different setups? Like some people use this setup and some people that? Uh-huh. So you just Oh yeah. No, uh it's. I dunno, I I've been working with that and Right, yeah, yeah. For noise conditions, yeah, it's pretty good. Well, th those f recognisers are are made there or just you just use them, that's it. You don't have to play with it. She even is f it was forbidden in the time, right, so you have recogniser, just use it. Don't play with it You just play with the features. Yeah, right, exactly. So. Yeah, yeah. Ri right. Yeah, yeah, I remember. With David Pearce he was. Yeah yeah, exactly. We are using those results more or less. But not everybody go. Not everybody. Many people stay, so. Yeah. Yeah, sure. But exactly. No. Well, many people place with Tandem, right, like M_I_T_ or who is using that some not M_I_T_, but somebody I dunno, Hynek told told us that somebody's using that um very actively, like okay, it it works for them and and they are close to like industry like uh you know, there is it's a research group of I know, I don't think it's ICSI. Somebody else or I don't know, somebody new uh s but I think some t somebody like A_T_ and A_T_ and T_ or what? Oh qu yeah. You me I don't know, he's even if Qualcomm? No. Now y oh yeah yeah. For right, right. No, I don't know if it's Yeah, I think so. Yeah, definitely. Maybe they are using P_L_P_s and but I th I believe that there is still some stuff up to yeah, for RASTA or whatever else, like all those decorrelations and transformations. I think it's there, but more or less it's again based on M_F_C_C_s or something like that. Or maybe traps. No, more or less I think yeah. Right. That's true, it's true. But that's different situation, because right? They need that it doesn't create that much, yeah. If something happened with the machine and uh I don't kn the alloc system, it doesn't matter, right? Somebody will fix it and change it, but Sure. Yeah. Even even not Viterbi, just even something simpler. Yeah. Uh I don't know. Do you think so? Yeah? Oh yeah yeah, yeah yeah, that's true. Yeah, sure. Yeah. Even on the phone. N some new one or From that C Cohen's? Oh, uh-huh. And but which uh phone it is? F Samsung, something? Where uh some but some which you can buy, right? It's It and you can train it, like only your speech or it's general uh it's quite interesting, yeah. Okay, uh-huh. Or you you can just buy that application, right? I don't know, like it's like you you've got some mobile phone, and then you can buy from another company some such application wh you can dictate your S_M_S_ or whatever. S I don't know if it is done now, but like But have you ever used like dictation system for Windows? Those uh there were many of them. Yeah, drag the mouse there. And I heard that it worked very very well. Um Alright. Right, newer speech, okay. Okay. How to say okay. Mm-hmm. I think, yeah yeah, definitely. It's based on the diphones diphones, right? Yeah, yeah. Yeah, I think it's very good, yeah. Mm-hmm. So you've got it installed in your machine here or Festival, right? Okay. Oh. A and have you have you tried it? And? So you can Uh-huh. Mm-hmm. Yeah, it's very difficult and Yeah, yeah. And you just uh have chosen some text? Any text you just put in and Okay. And it can read uh how does it read, from which um from which uh how to state it? Uh is it uh possible to read P_D_F_s or just standard text? Whatever, T_X_T_, right, files. Uh-huh. Yeah. Well let's Oh really? Anyway, who is the author of Festival? Edinburgh? And I thought it that's the guy from C_S_L_U_, from Portland, who was there during f Yeah, but there was No, no, but that that was the guy who's on a wheelchair now, he cannot m move even, he's very in like indic how to say that? Like he cannot have he just can speak only through some Festival system or something like that. He was the auth no, yeah, h he was working I think in w O_G_I_ even, he's very famous, but I thought that he was the author no, it was developed for him, like he was just first tester I think of the system. Something like that. Like it's not t text to speech, it's just the uh uh synthesis, which means which is more or less the same, but I dunno. Yeah. So it's somebody and Simon King is playing with that, right, the t guy. Okay, okay. I think that's enough. No other papers it's just one paper and and Festival. From Linux, it's very im interesting. Yeah.
And it's always the other way around, that's how you put it on. I think the the most of the bad words are at the beginning when the people try to put in on. Huh. Whoo. Ah, I cannot put it. Oh no no, not at all. Well Geneva, that's why I'm here at IDIAP, because I attended Geneva Eurospeech. I c Why, why you cannot? Mm-hmm. Exactly. Meet Okay, so you attended mostly posters, I guess. So how many presentation you you I realised it were that there were panel discussions only at the place where I could see the the t t you know, there were some boards saying that there is some panel discussion, uh but I I couldn't get no one knew where is it, what it is, where it happens and why and so forth. It was com Yeah. Finally I found, but Well, this is not this speech is not restricted only to papers and these things, you know. So how did you like Lisbon then? Yeah. Yeah, w we went to the Costa Caparica there, with the bus. We we were we were supposed to take take the direct one, but we didn't. North? Okay, so you stayed at the at the same coast, not not you didn't go to the island, cross the river where where is the bridge and and then okay, so that was a different place. Yeah, with Hamed uh Hemant and these people. The more south to th the hole through the water. Huh. I was I was very surprised that even the trams did have a A_C_ there. I couldn't see it at at all. Yeah, r right, but can you imagine something like in? Uh it's A_C_. Well, it it was much better if if if you would have to spend a hour and a half that uh Hemant and the others spent while going with a ca uh with a bus to to the conference centre from from the the hotel that we were. There was a direct bus and it took them hour and a half then. Mm. Yeah. Yeah, that was the right choice, but to take something which which was uh direct, in quotes. It was n not the f direct as anything else, but. Uh it sounds sounds like. Let's go on, this was just a Of what? Pruning of covariance matrix. But what what does it mean tied then? Uh yeah, but then you have to store the whole matrix and plus some extra information what is tied to what or b Yeah, but Uh I don't know about anything like this, but Tying. Mm. But ooh the the main point is to to uh like make it faster, the the the decoding, or what? Well that's possible, why why it should be i impossible? So it's completely impossible to do it with full covariance matrix? Why? Yeah, computationally e expensive, but Of course, from the point of view of computation, but otherwise it's there's no problem, no? with okay. Okay, so so here you have to choose either many many things to store and huge computation uh Yeah, yeah. No. Uh depends. So. So there are no results there? No Hmm yeah. Uh-huh. Yeah. Yeah. Mm-hmm. Okay. Why it has to be online? Okay. Like extended as as more as much as possible to keep the okay. Of course, of course. Whoo. Yeah, it's right. This is maybe because of F_F_T_U_ or these things. It's no not as small as as it should have been, but mayb maybe because of other processing after. No? Yeah. Well noisy, yes. Overlap you mean. Uh w how do you mean to say? Uh Um I have two questions, first is essentially how he's doing in that. No, well uh once he gets the frame, like from here to here, then where he advances to to start with c computation of the next frame? Yeah, mm-hmm. Okay. Okay. Okay. Mm-hmm. Fu ok okay, so do what do you mean full? Some some sub-window or the the whole ten seconds of speech or what? It has to be also some somehow Okay, something longer window than the one we are speaking about, okay? Uh w Okay, so you have a window and then you split it to s two parts and you shift the the point where you split it or okay. Okay, splitting point, okay. Okay. Okay, so basically we have two fixed points. Yeah, okay. Okay. Okay, so once you have say one hour of speech, you start at the beginning, and first you try with twenty milliseconds. Then you shift to twenty five and so forth until you Okay, and and you find the point where where what? Where where do you stop actually? Yeah, mm I know, but you start at twenty and go to sixty, but where which point do you choose? Uh okay, but which which likelihood ratio? Okay. Yeah, but uh likelihood, that's something and something. W what is likelihood? Okay. Yeah, but it has to yeah. Okay. So it's basic So Mm. Yeah. Yeah, but still. If Mm Yeah, but why do the main value which are comparing to, the the the overall value or overall likelihood or whatever you call it, why he's using the whole signal for this? Because so if the whole signal was uh steady somehow and it it would be able to be modelled by L_P_ model, well, then the residue will be very low. So he's supposing that the whole signal won't be able to to be captured by L_P_ model and that the residue would be high enough or ho wh why he's comparing the o to to the overall? You know, I I don't see the point of dividing over the the whole stuff. Y well, sounds sounds reasonable. I see, I see, I see. Okay, that's reasonable. Because actually if it if if uh the whole signal would be able to be modelled by L_ uh by L_P_C_ then anyway he has to design some some framing. So even if it's very stationary by d um by dividing over this whole stuff, he is able to find some reasonable boundaries. Um m ma makes sense um. Okay. And Mm. That has to be power, not energy definitely, no? Yeah, but it has to be normalised. By the length. Then power, instead of energy. Energy over length is power. Yeah, yeah. Duh duh duh duh duh. Okay. So but this so two things I still don't understand, thi this is supposed to be only for M_F_C_C_s and these easy features ah, one question I I wanted to ask before, does does he preserve within the feature vector the framing he's using finally, because does he somehow put the information about was the size about the frame to the feature vector itself or he's throwing away this framing? Something like this, because the recogniser could be good for the recogniser to know what was the chosen framing by this extraction. Okay. So so it n uh it might be bad for then for the back end, it might screw up the the t speech rate or normaliser. Huh. Oh, probably I didn't get it too, so the framing is actually equi-distant. Just the windows are wider oh ah, I see. Because it would be crazy, yeah, a little bit. Oh I see, so so it's it's like this. Yeah. I cannot imagine the transformation to get the modulation spectrum out of such a stuff. Uh. Mm. So why don't you still use it? If it works so well. Yeah, but you do the speech coding, so do you still use it? Why? Mm. Uh. Uh it's interesting, this. Well there should be after each conference there should be something like ten people should sit down, read the papers, and then rank them. You know, a all the all the all the similar papers within one session say sp uh focus to this topic, and say this is the best, this works slightly worse than this one, and throw away what is not that good and and th the first three three papers which are the best shou should be implemented and used from from that time onwards, like forget the all the M_F_C_C_s, because it's too old, and then start building the story on the new stuff and not 'cause it's Yeah, it is but I know, I know, uh th this is just a theory. Because Mm yeah. Yes, and I want to yeah. Yeah. It looks like Mm. Yeah? Yeah. Because to me it seems that everyone is still comparing to P_L_P_s or M_F_C_C_s, and there are so many new systems and everything is new, but it baseline is still uh t twenty years old and Yeah? Uh yeah. No okay, so So it should be every papers then should contain all the results from NIST evaluation or some standard task then. Otherwise Uh uh that's the other yeah. Okay, so it could be you you would have to publish the results on this standard task, and then you could say well, I also tried on this and this Yeah. You know, but then then only it makes some progress, this. Mm. Yeah, but can for for example can you use from some data from from the L_T_ world? For training say? Yeah, but they give you also training data or only the testing data? Oh, i uh since this is only okay. Okay, so so they give you only the training or development data only? Mm-hmm. Yeah. Hmm. Mm. Hmm. Mm-hmm. Mm. Uh. Al almost. Yeah. We are a couple of people at IDIAP and there are already two s two different setups. Stories and numbers, or Yeah. Uh I think we use di distinct train and test set, even se the different sets of phonemes. Different Well, I don't use only digits. My M_L_P_s are trained on everything and uh test set is digits, yeah. For for what? Ah. Yeah. Yeah, you cannot play with that. Because of the voices wouldn't be comparable then. Uh. Hmm. Because who is the one who makes who actually uses what we d try to do here? These are companies or because it's not scientists, scientists just start from P_L_P_ and then design the new features, new back end, or new anything, but they start from the old stuff. But who is the one who is using these results which we publish and try to make Yeah, but once you once you finish your P_H_D_ you go and it's over. And those who don't? They stay and publish another papers and they but they they don't turn into a life, no? into life. Mm uh-huh. Because I I won Yes. I wonder which level they are using out of what is there. Hmm. Yeah. They grab Hmm. Hmm. Okay, so we can hope that Nokia will be using tandem tandem P_L_P_s. Yeah, I cannot. Huh. So it's not related to ICSI. Yeah, but it was Mm. Hmm. Hmm. Hmm. Because they grab whatever is certain to work, and they they use it. They don't use anything which is to uh too new which is not that safe. It's like a it's like a NASA, you know, they use four, eight four, eight, six, machines to to to to put two n uh space ships, because it's it's known to work perfectly well or Pentium, not any P_ four or whatever, because it's safe, it has been working for twenty years, and then only it safe to put well not twenty but it's always yeah yeah, of course. Of course. Mm yeah. Mm. Hynek also told about this Cohen ce cell phone, which he was showing to to work l uh with the L_V_C_S_R_. He he was saying that there is something really really uh simple, the decoder I mean, not the features, but he was mentioning something to the decoder. This is just there is almo Yeah, something like no D_T_W_ or something, I don't know, I'm not really sure, but he mention that Hmm? That was impressive. Yeah, but it's that's a similar one to what Herve has, for example some he was showing it uh live on M_L_M_I_ c last M_L_M_I_ conference. You can look at the video and play it uh. It uh I'm not sure. Yes, commercial phone. What do you mean, always? Well, they they just bought a mobile, and based on the O_S_ it's using they they they developed the software. No, this is the development phone, this is not the commercial stuff with the software. They just use the device. By by what? I don't know if it's already the case right now. You had to get used to the system, not the system to you. It's nice. By the way, you should be the one to ask is there anything if I'm just a p private person and I want for free any software which is able to read me a book in English, uh for example that Festival, am I able to put the text to the system and to in real time to listen to the output? Festival is able to to to read just a mm normal plain text in reasonable comprehensible English or so that I can understand it? Non-native speaker? Mm okay. Yeah. And it's in real time on a reasonable machine? Okay. Okay. 'Cause the the diphone that I was tried to install it and there there are the voices like cat voice and these voices, this is not the one? Because uh I don't know which which um what i what is running where uh if it's diphone or which uh approach this is. Yes? So it's uh just uh wha samples of the diphones, and it choo chooses Yeah, on my on my laptop. Or here. Yeah, you can you can easily yes, I d I've tried, but there are some built-in like easy voices which you can y uh there is a default one completely synthetic, which you cannot understand at all, and th then there is a diphone synthesis I think it is the diphone synthesis from the some like something like cat w voice and somethi this is so-so, but it's this doesn't sound very nice, and we were playing at the uh summer school with them, actually with the authors, and they were playing us some other other approaches, and it it sounded like a human, it was great, but Yeah. So this is not this is not for free or something. But it it's not here so far. Okay, I'm I'm fine with this, but I can make it run on my machine. Okay. Okay. And and the engine is is in the Festival itself. So it's just a voice to to okay. okay. Mm-hmm. Mm mm. Okay. Hmm. Mm-hmm. Hmm. Okay. Uh it's right, I mean Yeah, I just s use the Douglas Adams, uh uh that that book, you know, the mm how it's called? Just general English textbook, I I cho I tried. No, the T_ T_X_T_. Yeah, ASCII file basically. You have to but you can take a P_D_F_ and convert P_D_F_ to ASCII. Yeah. It was a line with hashes and it was hash hash hash hash hash hash hash hash hash hash. Uh I'm not, then I don't want to go too much into details, just a little bit. Mm well I can prepare s sorry. These are the Edinburgh people, no? Mm-hmm, yeah, voice. That I know. Um uh. N uh. N uh. Mm.
At this precise time, yeah. Uh you can't attend multiple session at the same time, so. Hmm. Mm-hmm, yeah in. Hmm. Yeah. Hmm. Yeah. Hmm. But yeah. Yeah. I liked the invited speaker um about the implant in the ear. It's quite good, no? I think so, yeah. Yeah. Mm. Hmm. Hmm. No quite good uh. A bit hot still, but the bea the beach nearby is quite nice, yeah. You should try it, I don't know. Did you go around? Before or after, yeah. Okay. Yeah, 'cause I Yeah, I stayed I stayed two days before, so uh I went you ju you just take the train, maybe that's the same, I'm not sure. Train north. Half an hour train, yeah. Yeah. West or north, I don't know uh. No no no, not very far. No no. It's different, yeah. Yeah. Yeah. Hmm. Hmm. Well y We can take the tram and a train. It's it's all conditioned. Yeah. Ah. Mm.. Ah, sorry. No no no it's uh or s I'm not sure. Yeah. Yeah, originally I brought it because I marked some papers, but yeah. Onl only o one which surprised me a bit, it was uh speaker recognition I think, but it doesn't really matter. They were trying to model uh um covariance of different components for G_M_M_, instead of using a diagonal. But then if you use a full it's too much. So they use some approach where they tie them automatically and uh after that well, it's more like tying it is still full, but different um components are tied, different components of the matrix are tied. Equal. Uh No no, you have to store a minimum number a reduced number of parameters, and they are tied in a linear way actually, sorry, not equal, but it doesn't Yeah, exactly, that's it uh. So they ju yeah. Yeah. Interesting thing is that they applied it on speaker recognition, and on the features before they had three different ways to um decorrelate the features, and and this they showed it uh without this, with diagonal G_M_M_ or with this, and with this, with uh semi-tied covariance matrices, it was a less much less sensitive to uh which um decorrelation procedure you choose. If y so. A a also better, but of course it depends on the number of of parameters you put Yeah. Yeah. Yeah, I I think I think it was even slightly better, yeah. Maybe I can find it. L_D_A_, yeah, L_D_A_. Make it possible to use uh a full covariance matrix. Yeah. Yeah it's this one. It's i improved covariance modelling. Yeah. No, but in the proceedings you can find them. They are better, yeah. They tried P_C_A_, L_D_A_, M_L_L_T_. I i it's not really anything new, but uh the mm they just applied that to speaker I_D_. Yeah. Yeah. I isn't it smaller, the minimum, no? I don't know, I just Mm. Mm. Yeah, okay. But it's not uh an intrinsic limitation, it's just because of ri yeah. Okay. So it's not n it's not symmetric necessarily? 'Kay. So he's maximising that or mm Mm. Mm-hmm. Hmm. Yeah. Mm. Okay. Mm-hmm. Yeah, ho yeah. How did you cope with what you mentioned some time before, that if it's a very long session for each twelve point five frame you will actually get the same segment now. So how d Ah, okay. Mm-hmm. Mm. Sorry. Mm uh what it's a little bit what NIST is trying to do, no? So. Yeah, bu Yeah, but then we all optimise uh on the same task. So you cannot get anything new out of that after some time. Yeah, but then you run out of time. No, i ideally you're correct, uh but Mm. So yeah. Ho how how do they how do they evaluate? I just don't know. So they ask peop they ask people to sit and listen basically? Yeah? Hmm. Okay. But they d they don't have any measure like you have an original speech, you transcribe it and Okay. Hmm. Hmm. Hmm. Mm. So maybe well, but then well no, if somebody like NIST will have one recogniser and you just plug a different feature, but uh it's dangerous. Right. Mm. Hmm. Yeah. Yeah. Hem Hemant is using difference Seems Hmm. Mm. Yeah. Would be mm would be nice for Aurora maybe to have that. Aur Aurora. Yeah. No no, but yeah, yeah. Mm. Hmm. Yeah. Hmm. Yeah, Daimler, yeah. I don't know. You know, they're extremely secretive. I had a hard time just to get the signals. So I asked other questions but uh it's quite tough, yeah. Yeah. Yeah, they have to yeah. They have yeah, they have to do some test in their Mercedes, some speech recognition test, mm but it seems now they mm at least Daimler, they mostly work on dialogue management, uh but not much on features and recognition. No. Yeah. I d I mm. I doubt they are trying right now. I doubt that at Daimler they are trying right now, but maybe other companies, mm. Uh. Cohen? Jorda Jordan Cohen. No no. Mm. Mm. But there you could send an S_M_S_ I think, or something like that. Uh you could dictate text, so. On a phone, yeah. The phone that uh Cohen yeah, voice signals. Yeah. I don't know, that's Okay. Hmm. So it's still still real time? Hmm. Ah okay, but but not much, yeah. Yeah, yeah. Alright. Yeah, yeah. Yeah. Okay.
